Q-ViT: Accurate and Fully Quantized Low-Bit Vision Transformer
27
where ∥· ∥2 denotes ℓ2 normalization and l, h are the layer index and the head index.
Previous work shows that matrices constructed in this way are regarded as specific patterns
that reflect the semantic understanding of the network [226]. And the patches encoded
from the input images contain a high-level understanding of parts, objects, and scenes [83].
Thus, such a semantic-level distillation target guides and meticulously supervises quantized
ViT. The corresponding ˜Gl
qh;T and ˜Gl
kh;T are constructed in the same way by the teacher’s
activation. Thus, combining the original distillation loss in Eq. (2.17), the final distillation
loss is formulated as
LDGD =
l∈[1,L]
h∈[1,H]
∥˜Gl
qh;T −˜Gl
qh∥2 + ∥˜Gl
kh;T −˜Gl
kh∥2,
Ldistillation = Ldist + LDGD,
(2.23)
where L and H denote the number of ViT layers and heads. With the proposed Distribution-
Guided Distillation, Q-ViT retains the distribution over query and key from the full-
precision counterparts (as shown in Fig. 2.7).
Our DGD scheme first provides the distribution-aware optimization direction by pro-
cessing appropriate distilled parameters. Then it constructs similarity matrices to eliminate
scale differences and numerical instability, thereby improving fully quantized ViT by accu-
rate optimization.
2.3.5
Ablation Study
Datasets. The experiments are carried out on the ILSVRC12 ImageNet classification
dataset [204]. The ImageNet dataset is more challenging because of its large scale and
greater diversity. There are 1000 classes and 1.2 million training images, and 50k validation
images. Our experiments use the classic data augmentation method described in [224].
Experimental settings. In our experiments, we initialize the weights of the quantized
model with the corresponding pre-trained full-precision model. The quantized model is
trained for 300 epochs with a batch size of 512 and a base learning rate 2e−4. We do not
use the warm-up scheme. We apply the LAMB [275] optimizer with the weight decay set to
0 for all experiments. Other training settings follow DeiT [224] or Swin Transformer [154].
Note that we use 8-bit for the patch embedding (first) layer and the classification (last)
layer following [61].
Backbone. We evaluate our quantization method on two popular implementations of vision
transformers: DeiT [224] and Swin Transformer [154]. The DeiT-S, DeiT-B, Swin-T, and
Swin-S are adopted as the backbone models, whose Top-1 accuracy on the ImageNet dataset
are 79.9%, 81.8%, 81.2%, and 83.2%, respectively. For a fair comparison, we utilize the
official implementation of DeiT and Swin Transformer.
We give quantitative results of the proposed IRM and DGD in Table 2.1. As shown in
Table 2.1, the fully quantized ViT baseline suffers a severe performance drop on the classifi-
cation task (0.2%, 2.1%, and 11.7% with 2/3/4 bits, respectively). IRM and DGD improve
performance when used alone, and the two techniques enhance performance considerably
when combined. For example, IRM improves the 2-bit baseline by 1.7%, and DGD achieves
a 2.3% performance improvement. When IRM and DGD are combined, a performance im-
provement is achieved at 3.8%.
In conclusion, the two techniques can promote each other to improve Q-ViT and close
the performance gap between the fully quantized ViT and the full-precision counterpart.